Automatic Text Summarization Based on the Global Document Annotation

نویسندگان

  • Katashi Nagao
  • Kôiti Hasida
چکیده

The GDA (Glol)al Do(:ument Annotation) t)roject t)roposes a tag set which allows machines to automatically infer the underlying semantic/pragmatic structure of documents. Its objectives are to promote development and spread of N L P / A I at)plications to render GDA-tagged do(:uments versatile and intelligent (:ontents, wld(:h shouhl motiwtte W W W (World Wide Web) users to tag their doemnents a~s l)art of content authoring. This 1)aper discusses automatic text sunnnariz~tion based on GDA. Its mifin features are a domain/style-fi'ee algorithm and personalization on SUlmnarization whi(:h reflects readers' interests and preferences. In order to calculate the iml)ort~m(:e score of a text element, the algorithm uses st)re;uting aetiwttion on an intradoeulnent network whi(:h conm'.(:ts text elements via thematic, rhetorical, and corefere.ntial re.lations. The i)roi)osed method is flexible enough to dynami(:ally gen(,rate sl lnll l laries of wLrious sizes, i Slllll111ary t)rowse.r SUl)porting I)ersonalization is reported ~m well. 1, I n t r o d u c t i o n The W W W hiLs opened up all era in which an unrestricted nunfl)er of people i)ut)lish their messages (dectronically through their online do(:mnents. However, it is still very hard to automatically process (:ontents of those documents. The reasons include the following: 1. HTML (HyperText Markup Language) tags mainly specify the physical layout of documents. They address very fe.w (:on~,ent-related annotations. 2. Hypertext links cannot very nmch 11(;11) readers recognize the content of a document. 3. The W W W authors tend to 1)e less earefifl about wording and readability than in traditional t)rintcd media. Currently there is no systematic means for quality control in the WWW. Although HTML is a fle.xible tool that allows you to freely write and read messages on the WWW, it is neither very c(mvenient to readers nor suital)h: for automatic 1)roeessing of contents. We have been deveh)t)ing an integrated platfornl for (loeunmnt authoring, t)ul)lishing. &lid; reltse by combining natural language and W W W teehnoh)gies. As the first ste l) of our project, we ([efined a new tag set and developed tools for editing tagged texts and browsing these texts. The browser has the functionality of summarization an(l (:ont(ult-base(l retrieval of tagged docmnents. This l)aper focuse.s on summarization t)ased on this system. The main features of our summarization method are a dmnain/styh.~-free algorithm and l)ersonalization to reflect readers" interests and preferen(:es. This method mtturally outperfornm the tr~tditional summarization methods, which just pick out senten(:(,.s highly scored on the basis of superii(:iM clues such as word count, and so on. 2 G l o b a l D o c u m e n t A n n o t a t i o n GDA (Global Do(:mne.nt Almotation) is a challenging t)rojeet to Inake W W W texts nl&(:hineundel'standabh~ on the basis of a new tag set. and to develo l) Col~tent-t)ased presentation, retrieval, question-answering, summarization, and translation systems with mu(:h higher quality thorn before. GDA thus t)roposes an integrated global platform for ele(:tronic conl;ent authoring, presentation, and reuse. The GDA tag se.t is based on XML (Extensibh; Markup Language), and designed ~us (:Oml)atible as possible with HTML, TEl. EAGLES, and so forth. An example of a GDA-tagged sentence is as follows: t ime flies like an arrow . means sentential unit. . . . . alld lllealt 11o1111.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A survey on Automatic Text Summarization

Text summarization endeavors to produce a summary version of a text, while maintaining the original ideas. The textual content on the web, in particular, is growing at an exponential rate. The ability to decipher through such massive amount of data, in order to extract the useful information, is a major undertaking and requires an automatic mechanism to aid with the extant repository of informa...

متن کامل

Systematic literature review of fuzzy logic based text summarization

Information Overloadrq  is not a new term but with the massive development in technology which enables anytime, anywhere, easy and unlimited access; participation & publishing of information has consequently escalated its impact. Assisting userslq    informational searches with reduced reading surfing time by extracting and evaluating accurate, authentic & relevant information are the primary c...

متن کامل

Biogeography-Based Optimization Algorithm for Automatic Extractive Text Summarization

    Given the increasing number of documents, sites, online sources, and the users’ desire to quickly access information, automatic textual summarization has caught the attention of many researchers in this field. Researchers have presented different methods for text summarization as well as a useful summary of those texts including relevant document sentences. This study select...

متن کامل

Text Summarization Using Cuckoo Search Optimization Algorithm

Today, with rapid growth of the World Wide Web and creation of Internet sites and online text resources, text summarization issue is highly attended by various researchers. Extractive-based text summarization is an important summarization method which is included of selecting the top representative sentences from the input document. When, we are facing into large data volume documents, the extr...

متن کامل

EXTRACTION-BASED TEXT SUMMARIZATION USING FUZZY ANALYSIS

Due to the explosive growth of the world-wide web, automatictext summarization has become an essential tool for web users. In this paperwe present a novel approach for creating text summaries. Using fuzzy logicand word-net, our model extracts the most relevant sentences from an originaldocument. The approach utilizes fuzzy measures and inference on theextracted textual information from the docu...

متن کامل

A new sentence similarity measure and sentence based extractive technique for automatic text summarization

The technology of automatic document summarization is maturing and may provide a solution to the information overload problem. Nowadays, document summarization plays an important role in information retrieval. With a large volume of documents, presenting the user with a summary of each document greatly facilitates the task of finding the desired documents. Document summarization is a process of...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998